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Abstract 

Background: Accumulation of gene mutations in cells is known to be responsible for tumor progression, driving it 
from benign states to malignant states. However, previous studies have shown that the detailed sequence of gene 
mutations, or the steps in tumor progression, may vary from tumor to tumor, making it difficult to infer the exact 
path that a given type of tumor may have taken. 

Results: In this paper, we propose an effective probabilistic algorithm for reconstructing the tumor progression 
process based on partial knowledge of the underlying gene regulatory network and the steady state distribution of 
the gene expression values in a given tumor. We take the BNp (Boolean networks with pertubation) framework to 
model the gene regulatory networks. We assume that the true network is not exactly known but we are given an 
uncertainty class of networks that contains the true network. This network uncertainty class arises from our partial 
knowledge of the true network, typically represented as a set of local pathways that are embedded in the global 
network. Given the SSD of the cancerous network, we aim to simultaneously identify the true normal (healthy) 
network and the set of gene mutations that drove the network into the cancerous state. This is achieved by 
analyzing the effect of gene mutation on the SSD of a gene regulatory network. At each step, the proposed 
algorithm reduces the uncertainty class by keeping only those networks whose SSDs get close enough to the 
cancerous SSD as a result of additional gene mutation. These steps are repeated until we can find the best 
candidate for the true network and the most probable path of tumor progression. 

Conclusions: Simulation results based on both synthetic networks and networks constructed from actual pathway 
knowledge show that the proposed algorithm can identify the normal network and the actual path of tumor 
progression with high probability. The algorithm is also robust to model mismatch and allows us to control the 
trade-off between efficiency and accuracy. 



Background 

The construction of gene regulatory networks is an extre- 
mely difficult problem owing to their complexity and the 
difficulty of obtaining the relevant time series data, in 
terms of sampling rate, measurement accuracy, and 
quantity. For instance, microarray data usually come in 
samples much too small for accurate inference, have a 
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very low sampling rate relative to most cell signaling, 
measure average transcript across many cells, and are 
susceptible to many confounding factors which adversely 
affect the signal-to-noise ratio. In particular, for human 
cells, with data coming from patients, there are no time- 
course data and the data come from cells that have 
already undergone a sequence of mutations, so that the 
regulatory mechanisms of the original cell are no longer 
intact. Rather than depend on expression data, one can 
use known pathway information to construct regulatory 
relations and thereby develop an uncertainty class of 
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networks whose regulatory dynamics are consistent with 
the pathway knowledge. An algorithm for doing this has 
been developed in the context of Boolean networks [1]. If 
one could obtain wild-type time-course data, then one 
could reduce this uncertainty class by standard Boolean 
network inference methods. Given that in practice we 
usually only have access to stationary patient data and 
that the progression of mutations leading to the cancer- 
ous state has already occurred, we would like to use the 
available data to reduce the uncertainty class. In fact, 
since all we require is that we have an uncertainty class 
to begin with and wish to use the tumor data, from an 
algorithmic perspective it does not matter whether the 
uncertainty class arises from prior biological knowledge, 
wild-type data, or a combination of both. The proposed 
algorithm operates on the basis of probabilistic sequential 
fault-detection, which views regulatory alterations, such as 
gene mutations, as faults in the network wiring [2]. It esti- 
mates the sequence of faults leading to the current (can- 
cerous) regulatory structure, and from these estimates, 
a reduced uncertainty class for the original (healthy) net- 
work. By taking this approach the algorithm simulta- 
neously accomplishes a dual purpose: network inference 
and fault detection. 

The methodology is based on certain fundamental 
notions regarding cancer development, in particular, that 
the formation of a tumor is a complex process usually 
proceeding over a period of decades. Normal cells evolve 
into cells with increasingly neoplastic phenotypes. Tumor 
progression is driven by a sequence of randomly occur- 
ring mutations and epigenetic alterations of DNA that 
affect the genes controlling cell proliferation, survival, 
and other traits associated with malignant cell phenotype. 
To wit, tumor progression is a multi-step process of 
changes in the regulatory pathways. A set of pathways 
must be deregulated during the tumor progression until 
the tissue reaches a cancer state. A wide variety of normal 
adult human cell types can be transformed experimentally 
by perturbing five pathways [3]. Certain normal human 
cells require a greater or lesser number of changes before 
they will become transformed. Moreover, the regulatory 
pathways can be altered in many different ways leading to 
the same cancer. For instance, studies on colon cancer 
show that the great majority (~ 90%) of colon carcinomas 
suffer inactivation of the APC gene on Chromosome 5q 
as an early step in this process, about 40% to 50% acquire 
a K-ras mutation, 50% to 70% show an LOH of Chromo- 
some 17p involving p53, and about 60% show an LOH on 
Chromosome 18q. Most colon cancers will therefore 
begin with a Chromosome 5 alteration, but then will take 
alternative genetic paths on the road toward full-fledged 
malignancy [3]. 

In sum, although some common alterations may happen 
in tumor progression, different patients confront with 



different types of alterations during the progression, 
thereby making it important to find a way to identify 
mutations in order to apply appropriate intervention. Very 
little work has been done on the identification of genetic 
alterations (e.g. mutations) in cancer progression using 
network models. One such example is the work by Ger- 
stung et al. [4], where they predicted cancer progression 
by applying a conjunctive Bayesian network, in which the 
order of gene mutations is extracted. 

In this paper, we use the Boolean networks with pertur- 
bation (BNp) framework to model signaling pathways and 
ultimately predict the gene mutations that occurred during 
the tumor progression process. Boolean networks (BNs) 
have been used in a variety of other contexts and with dif- 
ferent objectives in biological applications. Kauffman [5] 
proposed that the cell types are the attractors. He intro- 
duced randomization into the networks, in terms of envir- 
onmental noise (random perturbation of individual genes) 
and mutation (not to be confused with the notion of 
mutation in cancer progression), which refers to changes 
in the wiring of the network. Random BNs and their char- 
acteristics have been extensively studied by Aldana et. al 
[6]. In a random BN, the average function in-degrees are 
constant and function outputs are assigned randomly. 
Serra et al. [7] investigated the effects of perturbation in 
the context of random BNs by knocking out a single gene. 
Additionally, intervention in BNp has been also studied by 
Dougherty et al. [8] and Qian and Dougherty [9] . 

In this work, we focus on BNs with perturbation owing 
to their role in modeling gene regulatory networks, a key 
point being that their dynamics can be modeled as Markov 
chains, thereby facilitating the modeling of genetic altera- 
tions in signaling pathways by shifting the network steady 
state distribution (SSD) from the normal (healthy) SSD 
toward the cancerous SSD. Having this tool in one hand 
to model signaling pathways, and the cancerous SSD 
extracted from the malignant tissue (e.g., based on gene 
expression data) on the other hand, one can test all the 
possible alterations on the BN satisfying the pathway 
information to see which one makes the SSD of the altered 
BN as close to the cancerous SSD as possible. This allows 
one to track the sequence of mutations. However, there 
are two concerns for using BNps to model signaling path- 
ways: (1) the network perturbation probability should be 
determined, and (2) signaling pathways provide us with 
incomplete information, which means that there may be 
too many BNs that satisfy the pathway information. The 
first issue can be mitigated by finding a good estimate of 
the perturbation probability. For example, inferring a BNp 
from a sequence of gene expression data has been studied 
in [10]. In fact, the second issue is the main source of 
uncertainty in our problem. To the best of our knowledge, 
the paper by Layek et al. [1] is the only work that proposed 
a method to extract the BN underlying the normal tissue 



Esfahani et al. BMC Bioinformatics 201 1, 12(Suppl 10):S9 
http://www.biomedcentral.com/1471-2105/12/S10/S9 



Page 3 of 1 5 



from a set of biological pathways. Although this paper 
introduced an elegant method for extracting the informa- 
tion needed for constructing Boolean networks from bio- 
logical pathways, it yields a large number of networks 
since the available network knowledge is often incomplete 
and not enough to point out the true network. To address 
this issue, we define the notion of a family of Boolean net- 
works, which is the set of all BNs that satisfy the con- 
straints that are imposed by a given set of pathways. For 
instance, for a 6-gene signaling pathway, the resulting 
family can contain 2 12 networks, all of which satisfy the 
constraints imposed by the pathway. 

As mentioned earlier, the main goal of this paper is two- 
fold: (1) to infer the normal network underlying healthy 
cells from this family, and at the same time, (2) to find the 
set of alterations that have occurred during the cancer 
progression. Toward this goal, we propose a probabilistic 
sequential fault detection algorithm that can effectively 
identify the best candidates for the original normal 
(healthy) network and the accumulated gene mutations. 

Methods 

Boolean networks (BNs) 

A Boolean network G ( V, F) is defined by a set V = {x lf 
X 2 ...» X n } of binary variables, x t e {0,1}, i - 1,..., n, and a 
list F = (fi, f 2 ,..., f n ) of Boolean functions. The value of 
Xi at time t + 1 is completely determined by a subset 
{xn> %i2> tXiJ c ^at time t via a Boolean function^ : 
{0, l} ki — » {0,1}. Transitions are homogeneous in time 
and we have the update: 

+ 1) = x i2 {t), x hi (t)). (l) 

Each Xi represents the state (expression) of gene /, where 
Xi = 1 and Xi = 0 represent gene i being expressed and not 
expressed, respectively. It is commonplace to refer to x t as 
the /th gene. The list F of Boolean functions represents the 
rules of regulatory interactions between genes. That is, any 
given gene transforms its inputs (regulatory factors that 
bind to it) into an output, which is the state or expression 
of the gene itself. All genes are assumed to update syn- 
chronously in accordance with the functions assigned to 
them and this process is then repeated. At any time t, the 
state of the network is defined by a state vector x(£) = (xi 
(£), x 2 (t), ...,x n (t)), called a gene activity profile (GAP). 
Given an initial state, a BN will eventually reach a set of 
states, called an attractor cycle, through which it will cycle 
endlessly. Each initial state corresponds to a unique attrac- 
tor cycle and the set of states leading to a specific attractor 
cycle is known as the basin of attraction (BOA) of the 
attractor cycle. 

A Boolean network with perturbation (BNp) is defined 
by allowing each gene to possess the possibility of ran- 
domly flipping its value with a positive probability p. 
Implicitly, we assume that there is an i.i.d. random 



perturbation vector y = (y lt y 2 , y n ), where y t e {0, 1}, 
the ith gene flips if and only if y t = 1, and p = P(y t ■ = 1) 
for i = 1, 2,..., n. If x(t) is the GAP at time t, then the 
next state x(£ + 1) is either f(x(£)) with probability (1 - 
p) n or x(t) 0 y with probability 1 - (1 -p) n , where f is 
the multi- output function from the truth table and 0 is 
component-wise addition modulo 2. Perturbation makes 
the corresponding Markov chain of a BNp irreducible 
and ergodic. Hence, the network possesses a steady state 
distribution, SSD(BNp), describing its long-run behavior. 
A BNp inherits the attractor structure from the original 
Boolean network without perturbation, the difference 
being that a random perturbation can cause a BNp to 
jump out of an attractor cycle, perhaps then transition- 
ing to a different attractor cycle prior to another pertur- 
bation. If p is sufficiently small, then the SSD will reflect 
the attractor structure within the original network. We 
can derive the transition probability matrix (TPM) P if 
we know the truth table and the perturbation probability 
p for a BNp. The TPM of a BNp can be decomposed as: 

P = (l-p) n Q + H, (2) 

where, Q is the TPM of the corresponding determinis- 
tic BN and H is a 2 n x 2 n matrix depending only on n 
and p [11]. 

We assume there exists a "normal" BN, denoted N nor . 
mai, corresponding to a healthy wild-type phenotype, 
and a family BN = {N l , N 2 , ... , N^ m '} of BNs posses- 
sing identical predictor sets as N norma i such that 
N normal G ®N • We refer to this family BN as the 
"uncertainty class" relative to N norma i. 

Given a BN, we define an "alteration" to be a change 
in the rule structure (i.e., truth table). A "path" Path - 
{alti , alt 2 , alt M } is defined as a sequence of M 
alterations. From a modeling perspective, M denotes the 
number of alterations that have occurred during the 
tumor progression and altj refers to the yth alteration. 
We assume that each alteration affects only a single 
gene and no two alterations in the same path affect the 
same gene. The result of applying a path of alterations 
to a Boolean network N is to produce an "altered net- 
work" [N; Path]. If we begin with a normal BN, N normab 
and apply a "cancerous path", Path a we obtain a cancer- 
ous network N cancer =[N norma i; Path c ]. The following 
commutativity and associativity properties follow from 
the definition: 

[N;{alt v alt 2 }] = [N; {alt 2 ,alt l }], 
[N;{alt l ,alt 2 },alt 3 ] = [N;{alt l ,alt 2 ,alt 3 }]. 

Alterations in cancer progression are commonly gene 
mutations, and the accumulation of gene mutations is 
usually responsible for cancer. Gene mutation includes 
both oncogene activation and tumor suppressor gene 
deactivation, resulting in either continuous activation or 
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deactivation of the corresponding genes. In the context 
of the BN model, such alteration in gene x t leads to per- 
manently setting the boolean function to f t ■ = 1 or = 0. 
We denote a gene mutation by a pair (/, k), which indi- 
cates that gene x t is stuck at k e {0, 1}. For convenience, 
we define (/J/j = (/, 0) and (/, 0) = {1, 1) . If a Boolean net- 
work N is altered by a mutation (/, /<), this mutated BN 
is denoted as [N; {(/, k)}]. For example, for a 4-gene 
Boolean network N, [N; {(1,0), (4,1)}] refers to a mutated 
version of N, where gene x x is permanently deactivated 
and gene # 4 is permanently activated. In this case, in the 
regulatory truth-table, we will have f x = 0 and f 4 = 1 for 
every set of predictors. Gene mutation, also called 
"1-gene function perturbation", has been studied [9]. It 
should be noted that, physically, the order of mutations 
can make a difference in cancer progression, since 
alterations affect the regulatory structure, thereby affect- 
ing subsequent cancer progression. There is, however, 
no way to take this into account given that we only 
have steady-state data and no data on transient beha- 
vior. From a mathematical perspective, the commutativ- 
ity in (3) means that a path is a set of alterations rather 
than a sequence of alterations; however, we employ the 
latter terminology owing to its commonplace usage. 

Now the problem to be addressed in this paper can be 
stated as follows: Given a family BJ\f of Boolean net- 
works, the steady state distribution (SSD) of the cancer- 
ous network, and the number of alterations M, what are 
the best candidates for N norma i and Path c ? Searching for 
the best candidate for the normal network involves esti- 
mating the distance between altered networks and the 
cancerous network. Since the only available information 
about the cancerous network is its SSD, we need to 
define a distance measure between two networks based 
on their SSDs. Given two BNs with perturbation N l p 
and N J p , we compute their distance as follows: 

D^ND^ptK^K^ (4) 

where n { = SSD(Nj,) , n j = SSD{N j p ), and p(TT i ,TT j ) is 
the Kullback-Leibler divergence (KL-divergence) 
between the SSDs and tt ; . This distance measure can 
be extended to BNs by first building the corresponding 
BNp for each BN using (2) and a given probability of 
perturbation p and then computing the distance 
between the resulting BNps. Without any ambiguity, in 
what follows, we use the same notation for a BN and 
the induced BNp for notational simplicity. 

Gene mutation effects 

Effects of gene mutation in a BNp 

In this section, we study the effect of a gene mutation (/, 
k) on the TPM of a BNp and its SSD. Gene mutations 
affect only the regulatory matrix Q in (2), where the 
mutation of each gene can be modeled as a multiplicative 



perturbation. Thus, for every mutation (/, k), we can find 
a corresponding transformation matrixT iyk such that the 
TPM of the altered BNp is given by: 

P = (1 - p) n QT iik + H = P + (1 - p)"Q(T a - 1), (5) 

where I is a 2 n x 2 n identity matrix. Based on this 
observation, we can easily prove the commutativity 
property shown in (3) (see Additional file 1 for the 
proof). According to the associative property shown in 
(3), a sequence of multiple gene mutations can be repre- 
sented by a single transformation matrix, which is a pro- 
duct of the transformation matrices, each corresponding 
to a single gene mutation in the sequence. For example, 
the TPM of a BNp altered by a threefold mutation, [N; 
{(h> h\ (h> k 2 ), k 3 )}], is given by: 

P = P + (l-p)"Q(T fi/fe T f2/fe2 T f3/fe3 -I). 

The effect of rank- one perturbations in the TPM of a 
Markov chain on the SSD has been studied in the con- 
text of structural intervention in gene regulatory net- 
works [9], and more generally in the framework of 
Markov chain perturbation theory [12]. We can utilize 
these results to analyze the SSD of the altered BNp, 
whose TPM is given by (5). 

In order to see how existing work on Markov chain 
perturbation can be used to analyze the effect of gene 
mutations on the SSD, consider two TPMs P and p 
that arise from the original network and the altered net- 
work, respectively. Let tt and ^ be the SSDs of the two 
networks, such that tt t P = if and ~ r p _ ~ r . We can 
find the analytical expression of the change in SSD 
using the fundamental matrix Z = [I - P + en T ] _1 , 
where e is an all-one column vector [13]. The funda- 
mental matrix Z exists for any ergodic Markov chain. 
Consider a rank-one perturbation, where the TPM of 
the perturbed Markov chain is p = p + ab r > where a, b 
are two arbitrary vectors satisfying b T e = 0, and P is the 
TPM of the original Markov chain. In this case, it can 
be shown that [14] the following is true: 

K T =K T + b T Z. (6) 

1 - b r Za 

Now, by representing the change of TPM due to a 
gene alteration as a sequence of rank-one perturbations, 
we can utilize (6) to predict the overall effect of the 
given mutation on the SSD of the network. To be more 
precise, suppose the BNp at hand undergoes a single 
mutation, (/, /<). The transition probability matrix p of 
the mutated BNp can be represented as follows: 

u 

P = P + (1 - p)"Q(T iife - 1) = P + (1 - p)» ■ £a r b), (7) 

i=i 
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for some vectors a ; and by satisfying b ; • e = 0 , and a 
positive integer u < 2 n ~ \ The proof can be found in Addi- 
tional file 1. Based on (6) and (7), the SSD of the altered 
BNp can be iteratively calculated in at most 2 W_1 iterations. 
Example: effect of mutations in a 3-gene network 
For illustration, let us consider a simple 3-gene BNp. 
Suppose the BNp is altered by (3,0), which means that 
the gene x 3 is permanently deactivated such that x 3 = 0. 
As a consequence, there cannot be any destination state 
in Q, which is the deterministic part of the TPM P in 
that arises from the regulatory structure of the BN, that 
corresponds to x 3 = 1. Hence, if we let Q = [qi... q 8 ], 
where q ; is the jth column in Q, the corresponding col- 
umns in Q should be shifted as follows: 



qi ^q 2 > 
q 5 ^q 6 ' 



q? ^qs' 



where q ; corresponds to the destination state with deci- 
mal representation / - 1, and q ; <^q t means the jth col- 
umn should be updated to q, + q ; and the /th column to 
0. Therefore, we get the following transformation matrix: 



£ 3,0 



1 


0 


0 


0 


0 


0 


0 


0 


1 


0 


0 


0 


0 


0 


0 


0 


0 


0 


1 


0 


0 


0 


0 


0 


0 


0 


1 


0 


0 


0 


0 


0 


0 


0 


0 


0 


1 


0 


0 


0 


0 


0 


0 


0 


1 


0 


0 


0 


0 


0 


0 


0 


0 


0 


1 


0 


0 


0 


0 


0 


0 


0 


1 


0 



and we have: 

Q(T 3 , 0 -i) = [q2 -q 2 q 4 -q 4 q 6 -q 6 q 8 -q 8 ] (8) 

Note that the rank of Q(T 3>0 - I) is at most 2 (w_1) = 4. 
Now, (8) can be decomposed as follows: 



-<l6 <l8 



" 1 " 




" 0 " 


-1 




0 


0 




1 


0 




-1 


0 


+ CJ4 


0 


0 




0 


0 




0 


0 




0 



(9) 



where b ; - • e = 0 for all b/. From (9) and (5), we get: 

4 

P = P + (l-p) 3 £a,b,, 



(10) 



1=1 



which is in the form shown in (7). Now, by utilizing 
the result in (6), we can analytically compute ^ through 
sequential rank-one perturbations as follows: 



rw + Ci-p) 3 ^)^ i= + - 1 h x z x 

1 - D l /j 1 a 1 

P 2 = Pi + (1 - P) 3 (a 2 b 2 ) -> ~ 2 = ~i + 1 H2 b 9 Z 9 



1 b 9 



(11) 



P 3 =P2+(l-pr(a 3 b 3 )-> 3 = 2" 



P = P 4 = P 3 +(l-p) 3 (a 4 b 4 )- 



1 b oZ oS a 



-b,z, 



1 - b 4 Z 4 a 4 



b 4 Z 4 



where n and n ] are the SSD vectors for P and p ; , 
respectively, which satisfy 7r r P = tt t and ttJP; = n J . Z ; 
are the corresponding fundamental matrices, as defined 
earlier. 

Overview of the proposed algorithm 

Suppose we are given a family BJ\f of Boolean networks 
that contains the normal network N normah Based on our 
definition, the cancerous network can be written as: 

N cancer = [ N normal'' Path A = [ N normal' W^c,V • •> "Km) 1 • 

Let SSD cancer denote the SSD of the cancerous net- 
work N cancer . We define the residual value for a given 
Boolean network N as: 



R{N):= {SSD cancer ,SSD{N p )), 



(12) 



where N p is the BNp with the regulatory matrix Q 
determined by the Boolean network N and the perturba- 
tion probability p. At each step, the algorithm alters the 
networks in the current family of networks through a sin- 
gle mutation. After the alterations, the algorithm keeps 
only those networks that lie within a certain distance 
from the cancerous network, where the distance is com- 
puted by (12). For the selected networks, the algorithm 
also keeps a record of the alterations that leads to these 
altered networks. Figure 1 provides an illustrative over- 
view of the algorithm. Suppose that initially, the network 
family BJ\f ={N 1 ,N 2 ,N 3 } contains three 3-gene net- 
works. We assume that N 1 is the true normal network 
N normal* an( ^ the cancerous network N cancer is obtained by 
taking the cancer progression path Path c = {(1, 0), (2,1)}, 
hence N cancer = [N normal ; {(1,0), (2,1)}]. Given the family 
BJ\f > the SSD of the cancerous network SSD cancer , and 
the number of mutations (set to M - 2 in this example), 
the algorithm tries to identify the best candidates for the 
normal network N norma i and the path Path c that may lead 
the original network into the cancerous network in two 
steps. 

Initially, for each network jyi E BM > there can be 6 
possible altered networks based on a single gene muta- 
tion. These altered networks are shown in the first row of 
Figure 1, in the middle plot. Among these networks, the 
algorithm only selects the networks whose SSDs are close 
to SSD cancer . Suppose we select the altered networks that 
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First iteration: 



First Mutation 




Network Selection 





□ :7V 1 
A :N 2 
O :N 2 

$JV X 



□ :[i\P;(/,jt)];/ = 1,2,3;* = 0,1 
A :[7V 2 ;(/,£)];/ = l,2,3;£ = 0,l 
O :[7V 3 ;(/,£)];/ = l,2,3;£ = 0,l 



Second Mutation 




Network Selection 




■ :[#';(/,*)];(/,*) = & 9M2.1), 0,0) 
A :[N 2 ;(l,kMhk) = (1,0) 




Second iteration: 



□ :[7V 1 ;{(l,0),(/^)}];(/^)^(l,0),(l,l) 

■ :[iV 1 ;{(2,l),(/^)}];(/^)^(2,0),(2,l) 

■ :[^ 1 ;{(3,0),(/,*)}];(/,*) 5 t(3,0),(3,l) 
A:[iV 2 ;{(l,0),(/^)}];(/^)^(l,0),(l,l) 

Figure 1 Illustrative overview of the algorithm. Sequential fault detection algorithm for a family fij\f that consists of three 3-gene Boolean 
networks (depicted as a square, triangle, or circle). Suppose N 1 is the (unknown) normal network that was altered into the cancerous network 
through M = 2 mutations. In the first row, all possible single mutations are applied to all networks, where the resulting altered networks are shown 
in the middle. The algorithm keeps only those networks whose residual value (the distance to the cancerous network) is less than p h resulting in a 
reduced family fij\f 1 . In the next step, we consider all possible single gene mutations to the networks in fij\f 1 , as shown in the middle of 
second row. The algorithm keeps only those networks whose residual value is less than p 2 (<Pd, which leads to a further reduced set fij\f 2 



I : [TV 1 ; (/, k)] ; (/, k) = (1, 0), (2, 1), (3, 0) 



A :[7V 2 ;(U)];(U) = (1,0) 



[N\{(1,Q),(2,1)}] 

[N 1 ;{(2,1),(1,0)}] 
[7V 2 ;{(1,0),(2,1)}] 



are within /^-distance of cancerous network. The 
selected networks constitute a new (and reduced) uncer- 
tainty class of networks 1 . Next, each network in 
fij\f 1 can be altered into 4 different networks based on 
an additional single gene mutation. These networks are 
shown in the second row of Figure 1, in the middle plot. 
Among these networks, the algorithm selects only those 
that are within /3 2 -distance from the cancerous network, 
resulting in a further reduced uncertainty class of net- 
works fij\f 2 . The family 2 contains the best candi- 
dates for the normal network and the cancerous path. 
For example, @j\f 2 in Figure 1 contains two candidates 
(N 1 and N 2 ) for the normal network. For N 1 , the cancer- 
ous path {(1,0), (2,1)}, and equivalently, {(2,1), (1,0)}, may 
lead it to the cancerous network with the given steady 
state distribution SSD cancer . Similarly, N 2 may be another 
candidate for the normal network, which may get close 
to the cancerous network also through the path {(1,0), 
(2,1)}. Note that the actual number of networks in gj\f 1 
and that in 2 will depend on the parameters /?i and 
/3 2 , respectively. 



Details of the proposed algorithm 

Algorithm 1 Fault Detection Algorithm 

Input : Family of BNs, Cancerous SSD, Number of mutations (i.e. M), perturbations probability p cancer 
Output : Set of all network-path pairs BM M 
Initialize : BAT 0 =BN = {N 1 , . . . , N |fW 1} 
Alt[ ={alt v alt 2 ,...,alt 2n _ 1 ,alt 2n },l = l,...,\BAr \ 
for m = 1 to M do 
BM m = ,c = 0 
for 1 = 1 to | BM m_1 1 do 
N <— N ; e BM m ~ l 
for Maltj e Alt' m do 
N <-[N;af£j] 
if R[N) < m then 
c <- c + 1 
N c 

BM m <- BM m U {N c } 
end if 

Alt c m+1 <- Alt c m - {alt j, alt j] 
end for 
end for 
end for 

return BM m 



The detailed procedure of the proposed algorithm is 
shown in Algorithm 1. At each step, one additional single 
gene mutation is considered. Therefore, to detect all M 
alterations that may have led the normal network into the 
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cancerous network, the algorithm needs to go through M 
sequential steps. In the first step, we consider all possible 
single mutations of the form (/, k) for every network N in 
the family BJ\f > which results in 2n\BN | altered net- 
works. Among these networks, we select only those net- 
works whose SSD can get close enough to SSD cancer , after 
M - 1 additional gene mutations. Based on this criterion, 
we select the network-mutation pairs, whose residual 
values (distance to the cancerous network measured based 
on SSD) are smaller than a threshold /}]_. In the second 
iteration we start with a new family of BNs gA/" 1 that 
contains the networks selected in the previous iteration. 
Since the gene that was mutated in the first step cannot 
go through another mutation, every network in fi]\f 1 can 
go through one of 2(n - 1) possible single gene mutations. 
Among these possible altered networks, we select only 
those networks whose residual values are smaller than a 
threshold /? 2 - After repeating these steps M times, the final 
class fi]\f M will contain the best network-path pairs, 
where each pair consists of a candidate for the normal net- 
work and the cancerous path that may drive the given net- 
work into the cancerous state with the given SSD. In each 
iteration, the threshold /? ; can be used as a control para- 
meter for trading between efficiency and accuracy. In gen- 
eral, we will have p 1 > p 2 > ... > Pm- 
Performance metrics 

In order to evaluate the performance of the algorithm, we 
define two metrics. The first metric is the probability that 
the algorithm will miss the true normal network N norma i 
and the actual cancerous path Path c of length M: 

P miss = Pr([N„ ormfli ; Path c ] £ BAf M ). (13) 

We can estimate this probability as follows. Let us 
define: 

p. : = F D lM<p ^\Pi) t Vi = 1, . . . , M, (14) 

where Fj} M ~ l >P cancer \d) is the cumulative distribution 
function (CDF) of the distance d between a BNp (with 
the perturbation probability p ca ncer) an d its altered ver- 
sion obtained by (M - /) mutations. Estimation of this 
CDF will be further discussed in the next section. Now, 
if we define: 

= (i - p ;+ i) M -'(i - p ( 1s) + p£L; i si <m - 1 (i5) 
p!1 = (i-Pi) M ' 

we can show that: 
P • ~p [M) (16) 

1 miss rmiss' v ' 

The proof can be found in Additional file 1. The sec- 
ond metric to be used is the probability that the 



algorithm will not be able to detect any network within 
g -distance of the true normal network N norma i : 

Pmiss,e ~ 

Pr(^N e BAf M : D(N, N nmmal ) <e), (17) 

These two metrics can be used to evaluate the accu- 
racy of the proposed algorithm. 

It would be also interesting to evaluate the computa- 
tional complexity of the algorithm. When performing an 
exhaustive search, the total number of residual value 
computations that would be needed to find the final 
network family fij\f M would be: 

i=0 

which is exponential with respect to the number of 
mutations M. 

Now, suppose that in the /th iteration of the proposed 
algorithm, a,-% of the networks are selected (i.e., 
\BM { [= 2{n - i) i | BM i_1 1 ) by controlling the para- 
meter p t . For finding $j\f M , our algorithm would need: 

M-l i 

^\BAf (Q(2(n-j) ,) (18) 

i=0 j=0 

residual value computations, where a 0 = 1. A smaller 
Pi will lead to a smaller a b thereby reducing the overall 
complexity of the algorithm. However, this will also 
increase the probability of missing the true network, 
hence the parameters /? ; can be used to control the 
trade-off between computational efficiency and the pre- 
diction accuracy of the algorithm. As we can see from 
(18), the computational complexity of the proposed 
algorithm is polynomial with respect to the number of 
genes n (for a fixed M), while it is exponential with 
respect to the number of mutations M (for a fixed ri). 
However, the parameters a } (j = 0, ... , M - 1) allow 
one to trade between computational efficiency and pre- 
diction accuracy. As a result, the proposed algorithm 
can accurately reconstruct the cancer progression path 
in a much more efficient manner compared to the 
exhaustive search, as will be demonstrated in our simu- 
lation results. 

Cumulative distribution function of the distance between 
a random BNp and its mutated version 

We estimate the CDF of the distance between a network 
and its altered version based on random BNs. We define 
a random Boolean network (RBN) as a BN: (1) whose 
gene predictors are randomly chosen such that every 
gene has k predictors, and (2) the truth table of every 
Boolean function f t follows an independent and 
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identically distributed Bemoulli(p h ) distribution, where 
p b is typically called the bias of the Boolean function 
f t . By allowing random perturbations with probability 
p in the RBN, we can obtain a random BNp (RBNp). 
First, we generate large number of RBNps with certain 
properties. Second, for each RBNp N pt we randomly 
introduce m mutations to obtain an altered network 
Np> and measure their distance D(N p ,Np) . Based 
on these observations, we can estimate the CDF, 
F D [m,p \d) = P(D(N p ,N p ) < d) where m is the number 
of single gene mutations and p is the perturbation prob- 
ability in the RBNp. 

Results and discussion 

Estimating the CDF of the distance between networks 

To execute the algorithm, we first estimated the CDF 
F D ^ m,p \d) of the distance d between an RBNp and its 
mutated copy. As with ensemble analysis in [15] [16] [17], 
we estimate these CDFs based on a large number of 
randomly generated networks with similar structural 
properties. The two most important parameters for gen- 
erating random BNs are their bias probability p b and 
connectivity^ As described earlier, p b is the mean of the 
Bernoulli distribution used to randomly generate the 
predictor function for each gene in a BN, and k is the 
maximum number of input variables for each of these 
functions. We randomly generated 4,000 BNps with 
these properties. For each network, we introduced ran- 
dom gene mutations and computed the distance 
between the original BNp and the altered BNp. We used 
the MATLAB function KSDENSITY to find the CDF 
that best fits the observed distance distribution. We 
repeated the overall experiment for different numbers of 
genes n, different perturbation probabilities p } and dif- 
ferent numbers of mutations m. 

The estimated CDFs are shown in Figure 2, for several 
different parameters. Figure 2- (A) shows the estimation 
results for 6-gene networks for one or two gene muta- 
tions. We can see that the distance increases when we 
increase the number of mutations while keeping the 
probability of perturbation fixed. Similar behavior can be 
observed for 8-gene networks shown in Figure 2-(C). We 
can also see that the distance is generally larger for the 8- 
gene networks. 

As we can see from Figure 2-(B), for 6-gene networks, 
increasing the perturbation probability from p - 0.001 
to p - 0.01 decreases the distance. This is intuitive, 
since gene mutation (see (5)) only affects the regulatory 
part, which plays less important roles as the perturba- 
tion probability p increases. As a result, changing the 
regulatory structure of a BNp will have less significant 
effects when p is larger. Figure 2-(D) shows the results 
for 8-gene networks, which show similar tendencies. 



Performance of the algorithm on synthetic network 
families 

We evaluated the performance of the proposed algorithm 
based on randomly generated families of Boolean net- 
works through Monte Carlo simulations. All random net- 
works in each of these families have identical structural 
properties (i.e., p b and /<). In each family, one network 
whose Boolean functions are canalizing functions was 
selected as the true normal network N norma i. A canalizing 
Boolean function is a function in which an input with a 
specific value determines the output of the function 
regardless of the other inputs. For instance,/^, x 2 ) = 
%iORx 2 is a canalizing function, where x 1 = 1 (and simi- 
larly, x 2 = 1) will make the output f{x lf x 2 ) = 1» regardless 
of the value of the other input variable. We randomly 
chose a path of length M and altered the normal network 
according to the given path to obtain the cancerous net- 
work. The steady state distribution (SSD) of the cancer- 
ous network was computed, to be used as an input for 
the proposed algorithm. Next, we used the proposed 
algorithm to find out whether it was able to infer the true 
normal network from a given family of networks and cor- 
rectly predict the actual cancer progression path, when 
provided with the number of mutations M and the can- 
cerous SSD. In our simulations, we used p can cer = 0.001, 
pb - 0.3, and k - 2. We considered 6-gene networks with 
M - 2 mutations and 8-gene networks with M - 3 muta- 
tions. For the case of 6-gene networks, we considered 
families of size | BAT \ = 2 8 and \BM \ = 2 10 . For the 
case of 8-gene networks, we considered families of size 
\BN | = 2 8 . The algorithm was implemented using 
MATLAB 7.9.0 (R2009b), and all simulations have been 
performed on a desktop computer with 2.67GHz Intel 
Core \7 CPU and 12GB RAM. Each SSD computation 
took around 9.2 x 10~ 4 sec and 5.7 x 10~ 3 sec for n = 6 
and n - 8, respectively. 

Table 1 summarizes the results of applying our algo- 
rithm to 500 randomly generated network families, 
where each family contains \BM \ = 256 6-gene net- 
works and the normal network undergoes two gene 
mutations. The threshold /5 1 was chosen such that 

1 = F^~ p ^~ l {p x ) for different values of p lt The second 
column in Table 1 shows the probability of missing the 
true network defined in (16). The third, fourth, and fifth 
columns show the empirically estimated probabilities. 
The sixth column shows the average number of net- 
works in the final network family fij\f 1 . The seventh 
column shows the average number of cancerous paths 
found in the final step, and the final column shows the 
average number of SSD computations needed for find- 
ing BN 2 ♦ As we can see in Table 1, increasing /?! (by 
controlling p x ) decreases the probability of missing the 
true normal network but increases the number of 
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D(BN, BN P ) D(BN, B~N P ) 

Figure 2 CDF of the distance between a random BNp and its altered version. CDF of the distance between a random BNp and its altered 
version. (A) 6-gene networks for different m (number of mutations) and p (perturbation probability). (B) 6-gene networks with m = 2 mutations 
for different perturbation probabilities. (C) 8-gene networks with different m and p. (D) 8-gene networks with m = 2 mutations for different 
perturbation probabilities. 



networks (and the corresponding cancerous paths) 
included in the final network family 2 . Further- 
more, the number of SSD computations will increase if 
we use a larger fix (by increasing p±). A similar trend 
can be also observed in Table 2, which summarizes the 
simulation results for 200 families with | BAf | = 1024 
6-gene networks. 

We also evaluated the performance of the proposed 
algorithm based on randomly generated network 
families, each of which contains I BJ\f I = 256 networks 



with 8-genes, with M = 3 gene mutations. The experi- 
mental results are summarized in Table 3. The first col- 
umn in this table shows the probabilities p 1 and p 2 that 
were used to choose /5 1 and /3 2 > using (14). The thresh- 
old /? 3 was set to /3 3 = 0.1. Table 3 shows that increasing 
the threshold values result in a higher probability of suc- 
cess (i.e., smaller probability of missing the true normal 
network) but also a higher computational cost, as we 
would expect. In practical situations, the actual pertur- 
bation probability p can cer ma y n °t be exactly known, in 



Table 1 Performance of the proposed algorithm evaluated on 500 randomly generated network families 



\BM 


= 256 


i Pcancer ~ P 


= 0.001, 


& = 0.1 






p . 

'miss 


pemp 
miss 


p emp p emp 

1 rai55,e=0.1 1 miss ,e =0.2 


AVG of 


BN 1 


AVG of # of paths 


AVG of # of SSD calculations 



p ] = 0.1 0.81 0.66 0.58 0.57 24.3 3.71 3,421 

p, = 0.3 0.49 0.45 0.41 0.39 45.11 4.17 4,677 

Pt= 0.5 0.25 0.24 0.21 0.20 64.27 4.41 5,918 

p } = 0.7 0.09 0.06 0.04 0.04 94.84 4.60 9,208 



Performance of the proposed algorithm evaluated on 500 randomly generated network families. Each family contained BM = 256 6-gene networks [k = 
2, M = 2, and p b = 0.3). 
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Table 2 Performance of the proposed algorithm evaluated on 200 randomly generated network families 



\BM 


= 1024 


i Pcancer ~ P 


= 0.001, p 2 = 0.1 




Pi 


p . 

'miss 


pemp 
1 miss 


p emp p emp 

1 miss, ,e=0.1 1 miss ,6=0. 2 


AVG of 


BM 2 


AVG of # of paths 


AVG of # of SSD calculations 



Pi = 0.1 0.81 0.65 0.57 0.56 68.3 4.45 13,289 

P!= 0.3 0.49 0.46 0.40 0.39 138.17 4.70 17,319 

Pi= 0.5 0.25 0.23 0.17 0.17 206.7 4.97 21,846 

P! = 0.7 0.09 0.05 0.03 0.03 293.4 4.98 36,348 



Performance of the proposed algorithm evaluated on 200 randomly generated network families. Each family contained BAf = 1024 , 6-gene networks {k 
= 2, M = 2, and p b = 0.3). 



which case we would have to estimate the probability. 
To evaluate the robustness of the proposed algorithm, 
in the presence of model mismatch, we performed 
another set of simulations, whose results are summar- 
ized in Table 4. We used randomly generated network 
families, each with \BAf \ = 256 6-gene networks, and 
considered M = 2 mutations. As we can see in Table 4, 
there was no significant performance degradation when 
the perturbation probability p used by the algorithm was 
different from the true perturbation probability p can cer- 
The results for families with | BJ\f | = 1024 networks 
are summarized in Table 5, which show that the pro- 
posed algorithm is robust to model mismatch. Finally, 
Table 6 shows the results for network families with 
| BAf | = 256 8-gene networks with M = 3 gene muta- 
tions, which also clearly shows the robustness of our 
algorithm. 

Performance on cancerous networks involving the p53 
gene 

Next, we evaluated the performance of the proposed 
algorithm based on a family of BNs constructed from 
pathways that involve the p53 gene. Tumor suppressor 
gene p53 has been extensively studied and it is known 
to be involved in various well-known biological path- 
ways. It has been observed that p53 is mutated in 30- 
50% of common human cancers [3]. In fact, in the pre- 
sence of DNA damage, a mutant p53 may lead to the 
emergence of abnormal cells. Figure 3 shows the ATM- 
p53-Wipl-Mdm2 pathways that involve the tumor sup- 
pressor gene p53 [18]. 



These pathways operate in different ways depending 
on the context, determined by the presence (or absence) 
of a DNA damage event that results in DNA double- 
strand breaks. Here, we consider the case when DNA 
damage is present, which may lead to the development 
and progression of tumor. Under this context, we con- 
sider single and double mutations in the given pathways, 
where we focus on the mutation of p53 and Mdrn2. 
Each gene alteration can be one of the three forms: 
mutation, amplification, or deletion. Sequencing data of 
138 patients with glioblastoma, provided by TCGA, 
showed that 32% and 12% of the patients suffered from 
the alteration in the pS3 and Mdm2 genes, respectively. 
Also among 316 patients with serous ovarian cancer, 
96% suffered from the mutation of p53. A similar study 
has revealed that about 26% of 216 patients with sar- 
coma have amplified Mdm2. Mutation in p53 and 
amplification in Mdm2 have been also simultaneously 
observed in some cases. Based on these observations 
made in existing cancer studies, we consider the follow- 
ing types of alterations in our experiments: (j?53,0), 
(Mdm2,l), and {(p53, 0), (Mdm2, 1)}, where p53 is per- 
manently deactivated and/or Mdm2 is permanently acti- 
vated. In a recent work [1], it has been shown that the 
pathways in Figure 3 do not uniquely determine the 
normal Boolean network N normai that governs healthy 
cells. We used the method proposed in [1] to enumerate 
all possible Boolean networks that satisfy the constraints 
imposed by the given pathways. Following [1], we con- 
structed four Karnaugh maps, one for each gene in the 
given pathways. Karnaugh maps have been used in logic 



Table 3 Performance of the proposed algorithm evaluated on 100 randomly generated network families 



\BAf 


= 256 ,Pcancer = P = 0.001, ft 


= 0.1 


P1/P2 


P miss 


pemp 

miss 


pemp 
miss, e=0.1 


Ct=0.2 AVG0f 


BAf 2 


AVG of # of paths AVG of # of SSD calculations 



P1 = 0.1, p 2 = 0.1 0.95 0.74 0.68 0.68 28.6 8.74 8,158 

P1 = 0.3, p 2 = 0.3 0.66 0.42 0.36 0.34 57.5 10.2 13,999 

P] = 0.5, p 2 = 0.5 0.34 0.1 6 0.1 3 0.1 3 1 1 1 .3 1 3.04 35,039 

P] = 0.7, p 2 = 0.7 0.1 1 0.05 0.03 0.03 123.1 1 1.45 89,788 



Performance of the proposed algorithm evaluated on 100 randomly generated network families. Each family contained BAf = 256 8-gene networks {k = 
2, M = 3, and p b = 0.3). 
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Table 4 Performance of the proposed algorithm in case of model mismatch. Evaluated on 500 randomly generated 
network families 









\BM | = 256 


, Pcancer= 0.001, p = 


0.003, ft = 0.1 




Pi 


pemp 
1 miss 


1 rai55,e=0.1 


pemp 

1 miss, e =0.2 


AVG of 


BM 2 


AVG of # of paths 


AVG of # of SSD calculations 


Pi = 0.1 
Pi = 0.3 
Pi = 0.5 
Pi = 0.7 


0.88 
0.49 
0.25 
0.18 


0.77 
0.43 
0.20 
0.15 


0.76 
0.42 
0.19 
0.14 


8.65 
46.8 
61.31 
77.82 


2.3 
4.29 
4.35 
4.41 


3177.1 
4693.9 
5802.1 
6870.7 



Performance of the proposed algorithm in case of model mismatch. Evaluated on 500 randomly generated network families. Each family contained | BAT | = 256 
6-gene networks {k = 2, M = 2, and p b = 0.3). 



circuit design to simplify a given Boolean function and 
derive its minimal representation. In a Karnaugh Map 
[19], each position in the map (i.e., an element in a 
matrix) corresponds to a specific state (defined by the 
values of all genes in the network), such that neighbor- 
ing states have unit Hamming distance. The value at 
each position indicates the value of a particular gene at 
the next time point, which is a Boolean function of the 
current state. The resulting maps are shown in Figure 4. 
In these tables, each line-segment, attached to a gene, 
shows the locations where that gene the takes value 1. 
The symbol X is used to indicate positions where the 
available pathway information was not enough for 
uniquely determining the table entries. These entries 
may take either 0 or 1 without violating the constraints. 
As a result, the given Karnaugh maps give rise to an 
uncertainty class of networks BM that contains 2 12 , 
where 12 is the number of entries in the given maps 
that cannot be uniquely determined. Since Mdm2 is 
directly connected to three genes in Figure 3, we assume 
the connectivity to be k = 3, which is used to estimate 
the CDF of the distance between a random BNp and its 
altered version. The BN reported in [1] is assumed to be 
the true normal network N normab as this network was 
shown to faithfully reproduce the experimentally 
observed behavior of the genes in published literature. 
We assumed p cancer to be 0.001. As in the previous sec- 
tion, we evaluated the performance of the proposed 
algorithm under two different cases: when we have a 



perfect estimate of the perturbation probability (p = 
Pcancer) and when there is a model mismatch (p * 

Pcancer)' 

Case-1 : deactiviation of p53 

We considered the alteration of the Boolean network 
reported in [1] through the permanent deactivation of 
/?53 (i.e. (^>53, 0)). We used our algorithm to detect the 
true normal network and the gene mutation. Table 7 
shows the simulation result when the threshold was set 
to /?! = 0.05. The second column in the table shows the 
number of networks in the final network family, and the 
third column shows the total number of network-path 
pairs predicted by the algorithm. The fourth column 
shows the number of different paths in the predicted 
result. We also categorized the result of each experi- 
ment as a "success (S)" or a "failure (F)", based on 
whether the final prediction contained the true network- 
path pair or not. As we can see, our algorithm was able 
to reduce the uncertainty class of networks without 
missing the true network for p = 0.001,0.003,0.005. For 
p = 0.007, the algorithm missed the true network, 
mainly because the perturbation probability was high 
enough to render the effects of the regulatory structure 
of the network relatively insignificant. Increasing f} 1 
from 0.05 to 0.1 increases the number of network-path 
pairs included in the final prediction, thereby preventing 
the algorithm from missing the true network, as shown 
in Table 8. In terms of fault detection, the proposed 



Table 5 Performance of the proposed algorithm in case of model mismatch. Evaluated on 200 randomly generated 
network families 





BM =1024 


, Pcancer = 0.001, jD = 


0.003, ft = 0.1 




Pi 


pemp 

miss 


pemp 
miss, .g =0.1 


pemp 

1 miss,<==0.2 


AVG of 


BN 2 


AVG of # of paths 


AVG of # of SSD calculations 



Pi = 0.1 0.94 0.82 0.80 14.33 2.52 12,454 

Pt= 0.3 0.43 0.37 0.36 140.5 4.9 17,850 

= 0.5 0.28 0.22 0.21 172.48 4.54 21,175 

Pi=0.7 0.19 0.13 0.13 234.8 4.60 26,060 



Performance of the proposed algorithm in case of model mismatch. Evaluated on 200 randomly generated network families. Each family contained | BM | = 1024 , 
024 6-gene networks {k = 2, M = 2, and p b = 0.3). 
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Table 6 Performance of the proposed algorithm in case of model mismatch. Evaluated on 100 randomly generated 
network families 



BM = 256 , Pcancer = 0.001, p = 0.003, ft = 0.1 


P^Pl 


pemp 
1 miss 


pemp p emp AVG of 
mi55,e=0.1 mi55,G=0.2 


BM 2 


AVG of # of paths 


AVG of # of SSD calculations 



Pl = 0.1, p 2 = 0.1 0.95 0.76 0.75 25.75 7.88 5,724 

P1 = 0.3, p 2 = 0.3 0.48 0.44 0.43 46 10.79 12,720 

P1 = 0.5, p 2 = 0.5 0.21 0.17 0.16 97.8 14.34 30,146 

P] = 0.7, p 2 = 0.7 0.19 0.14 0.14 100.8 9.69 47,755 



Performance of the proposed algorithm in case of model mismatch. Evaluated on 100 randomly generated network families. Each family contained | BM | = 256 
8-gene networks {k = 2, M = 3, and p b = 0.3). 



algorithm performed very well As shown in Table 7 and 
Table 8, the algorithm was able to correctly pinpoint the 
actual gene mutation out of 8 possible mutations, when 
it was successful. 

Case-2: amplification of Mdm2 

Next, we altered the normal network by mutating 
Mdm2 such that it is amplified (i.e. (Mdm2,l)). The 
results are summarized in Table 9 and Table 10 for fi 1 
= 0.05 and /5 1 = 0.1, respectively. For f$ Xl our algorithm 
was able to reduce the uncertainty class of networks 
without missing the true normal network for p - 0.001 
and p = 0.003. When the perturbation probability 
became larger, the regulatory structure from the path- 
ways was obscured, and the algorithm was not able to 
effectively reduce the uncertainty class (e.g., see Table 9, 
p = 0.005 and p = 0.007). By increasing p 1 from 0.05 to 



DNADSBs 



I 




Figure 3 ATM-p53-Wip1-Mdm2 pathways. 

\ / 



0.1, the algorithm could successfully reduce the uncer- 
tainty class for p - 0.005, as shown in Table 10. 

Case-3: simultaneous deactivation of p53 and 
amplification of Mdm2 

Finally, we considered the case when j?53 was deactivated 
and Mdm2 was amplified at the same time. Table 11 and 
Table 12 summarize the results of applying the proposed 
algorithm for the case of double gene mutations: (^53,0) 
and (Mdrn2,l). As we would expect, the proposed algo- 
rithm did not perform well in this case, since introducing 
two gene mutations in a 4-gene network almost comple- 
tely obscures the regulatory structure in the original nor- 
mal network. In fact, the networks in the initial 
uncertainty class BM will yield similar (or identical) 
SSDs once we mutate two genes. These results lead to an 
interesting insight into the expected performance of the 
proposed algorithm. As mentioned throughout the paper, 
the proposed algorithm aims to backtrace the set of gene 
mutations that has led to an (unknown) cancerous gene 
regulatory network with a given SSD. Suppose the num- 
ber of mutations M is relatively small compared to the 
total number of genes n in the network (e.g., Mln ~ 0). In 
such a case, the dynamics of the cancerous network 
would be largely governed by the regulatory mechanisms 
in the original healthy network. Even though it is theore- 
tically possible that a few gene alterations lead to signifi- 
cant changes in the overall SSD, identifying these 
alterations would be still feasible since the regulatory 
structure of the original network would remain mostly 
intact. However, if the number of mutations gets larger 
(e.g., Mln ~ 1), the activity of many genes would be 
"frozen", either being permanently deactivated or perma- 
nently amplified, in which case the dynamics and the reg- 
ulatory structure of the original network would be 
significantly lost. As a result, networks that originally 
have very distinct structures may yield similar SSDs as a 
result of the accrued mutations. In this case, it would be 
difficult for the algorithm to make predictions with high 
accuracy, since the available information would be too 
small to effectively cope with the present uncertainty. 
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Figure 4 Karnaugh maps generated by the pathways shown in Figure 3. 
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Conclusions 

We proposed an effective probabilistic algorithm for 
reconstructing the tumor progression process. Given an 
uncertainty class of networks, which arises from our 
partial knowledge of the true gene regulatory network 
represented by biological pathways, and the steady state 
distribution of a cancerous network, the proposed algo- 
rithm tries to simultaneously infer the true gene regula- 
tory network that underlies healthy cells and to predict 
the sequence of gene mutations that occurred during 
the tumor progression process. As demonstrated by our 
experiments, based on both randomly generated 



networks and realistic networks constructed from 
known biological pathways that involve the tumor sup- 
pressor gene p53, our algorithm can effectively cope 
with the uncertainty present in gene regulatory networks 
and accurately infer the normal (healthy) network and 
the actual path of tumor progression with high probabil- 
ity. Furthermore, the proposed algorithm is robust to 
model mismatch and provides us with effective means 
for trading prediction accuracy for computational 
efficiency. 

The computational complexity of the algorithm depends 
on the number of genes in the network, the number of 



Table 7 Performance of the proposed algorithm in the 
case when p53 is deactivated 



# networks # network-path pairs # paths 



result 



Table 8 Performance of the proposed algorithm in the 
case when p53 is deactivated 

p # networks # network-path pairs # paths result 



0.001 


2048 


2048 


1 S 


0.001 


2048 


2048 


1 S 


0.003 


2048 


2048 


1 S 


0.003 


2048 


2048 


1 S 


0.005 


512 


512 


1 S 


0.005 


1904 


1904 


1 S 


0.007 


0 


0 


0 F 


0.007 


832 


832 


1 S 



Performance of the proposed algorithm in the case when p53 is deactivated. 
Threshold was set to = 0.05 and the true perturbation probability was 
assumed to be (p ca ncer = 0.001). 



Performance of the proposed algorithm in the case when p53 is deactivated. 
Threshold was set to = 0.1 and the true perturbation probability was 
assumed to be ip ca ncer = 0.001). 
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Table 9 Performance of the proposed algorithm in the 
case when Mdm2 is amplified 



p # networks # network-path pairs # paths result 



0.001 1088 


1174 


3 S 


0.003 520 


540 


2 S 


0.005 0 


0 


0 F 


0.007 0 


0 


0 F 


Performance of the proposed algorithm in the case when Mdml is amplified. 


Threshold was set to 


= 0.05 and the true perturbation 


probability was 


assumed to be {p cancer = 


0.001). 




Table 10 Performance of the proposed algorithm in the 


case when Mdm2 is amplified 




p # networks 


# network-path pairs 


# paths result 


0.001 1088 


1184 


3 S 


0.003 832 


894 


3 S 


0.005 520 


544 


2 S 


0.007 0 


0 


0 F 


Performance of the proposed algorithm in the case when Mdml is amplified. 


Threshold was set to 


=0.10 and the true perturbation probability was 


assumed to be (p can cer = 


0.001). 





Table 1 1 Performance of the proposed algorithm when 



p53 is deactivated and Mdm2 is amplified 




# 

networks 


# network-path 
pairs 


# 

paths 


result 


# SSD 
calculations 


Pi = 
0.1 


730 


1333 


4 


F 


38,684 


Pi = 

0.3 


2458 


3810 


4 


F 


69,050 


Pi = 
0.5 


3937 


7208 


4 


S 


93,650 


Pi = 
0.7 


4096 


9009 


4 


S 


113,612 


Performance of the proposed algorithm when p53 is deactivated and Mdml is 
amplified. Several different values of jSq was used (by varying p{), and p 2 was set to 
0.1 . The perturbation probability was assumed to be known (p = p cancer = 0.001 ). 


Table 12 Performance of the proposed algorithm when 
p53 is deactivated and Mdm2 is amplified 


Pi 


# 

networks 


# network-path 
pairs 


# 

paths 


result 


# SSD 
calculations 


Pi = 
0.1 


1063 


1629 


4 


F 


40,016 


Pi = 

0.3 


1661 


2582 


4 


F 


48,038 


Pi = 
0.5 


2704 


4089 


4 


F 


69,146 


Pi = 


4096 


8839 


4 


S 


101,426 



0.7 

Performance of the proposed algorithm when p53 is deactivated and Mdml is 
amplified. Several different values of was used (by varying p y ), and p 2 was 
set to 0.1. We assumed that the true perturbation probability is unknown, 
hence there is a model mismatch (p = 0.003, p cancer = 0.001). 



mutations, and the number of networks in the initial 
uncertainty class, and increasing any of these numbers will 
increase the computational overhead. Based on the mathe- 
matical representation of Boolean networks, increasing the 
number of genes will exponentially increase the number of 
possible networks. However, this rapid increase does not 
necessarily mean that the size of the uncertainty class of 
networks that we need to deal with will increase at the 
same rate. For example, many of the mathematically possi- 
ble networks may not be considered biologically viable, 
hence may be omitted in practice. Moreover, although the 
total number of states in a Boolean network with n genes 
is 2 W , many states may be eliminated via state reduction, 
and the reduced network may consist of considerably 
fewer states [20]. In fact, the whole idea of network reduc- 
tion is relevant to the present problem, just as it is to 
determining control polices for gene regulatory networks, 
where computational intractability prohibits the design of 
control policies without constraining the network size 
[21]. For example, even when the gene expression levels 
are restricted to be binary, a network with 15 genes, absent 
some form of state reduction, cannot be considered, 
because the size of the resulting transition probability 
matrix would be 2 15 x 2 15 , making any kind of dynamic or 
control analysis intractable. Another possible way to 
reduce the complexity of the algorithm is to restrict the 
possible gene mutations via the use of prior knowledge. 
For example, we may restrict the possible mutant genes 
only to a smaller subset of genes that are known to be sus- 
ceptible to mutation. Furthermore, prior knowledge con- 
cerning the expected type of mutation for a susceptible 
gene (e.g., "amplification" for oncogenes and "deactivation" 
for tumor suppressor genes) can be taken into account. 
Although we did not constrain the possible gene muta- 
tions nor applied any network reduction technique in this 
study, such modifications are fairly straightforward and 
may be used to enhance the overall computational effi- 
ciency of the proposed algorithm. 
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